Core percolation and onset of complexity in Boolean networks 
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The determination and classification of fixed points of large Boolean networks is addressed in 
terms of constraint satisfaction problem. We develop a general simplification scheme that, removing 
all those variables and functions belonging to trivial logical cascades, returns the computational core 
of the network. The onset of an easy-to-complex regulatory phase is introduced as a function of the 
parameters of the model, identifying both theoretically and algorithmically the relevant regulatory 
variables. 
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Introduction — Boolean Networks (BN) are dynami- 
cal models originally introduced by S. Kauffman in the 
late 60s Q. Since Kauffman's seminal work, BN have 
been used as abstract modeling schemes in many differ- 
ent fields, including cell differentiation, immune response, 
evolution and gene-regulatory networks (for an introduc- 
tory review see Q and references therein) . In recent days, 
BN have received renewed attention as a powerful scheme 
of data analysis and modeling of high-throughput genome 
and proteome experiments [3|- 

The central issue of previous research was the descrip- 
tion and classification of different BN attractor types un- 
der deterministic parallel update dynamics Q, . Spe- 
cial attention was dedicated to so-called critical BN 
situated at the transition between ordered and chaotic 
dynamics. We follow a complementary approach: Our 
main goal here is to study the statistical properties of 
fixed points of the dynamics of general BN. We first refor- 
mulate the original dynamical problem into a constraint 
satisfaction problem, and then we study it with tech- 
niques borrowed from statistical mechanics of disordered 
systems, cf. Q. As a result, we are able to classify dif- 
ferent types of fixed-point organization, and to identify 
the set of relevant regulatory variables. 

The model — In Kauffman networks, the expression 
of one gene is modeled by a Boolean variable: the ex- 
pression level Xi of a gene i takes only the values (i is 
not expressed) or 1 (i is expressed) Having N genes, 
the global expression pattern of a cell is thus given by a 
TV-dimensional vector x = (x\, xn) S {0, 1}^ with bi- 
nary entries. Even if this binary idealization seems to be 
oversimplified with respect to biological reality (including 
mRNA and protein concentrations, post-transcriptional 
and post-translational modifications, etc.), it reflects the 
frequently only qualitative nature of biological knowl- 
edge. Moreover, it is expected that robust genome-wide 
biological processes can be qualitatively understood on 
this level of modeling. In this context, one can hope to 
describe gene regulation via Boolean functions ||: the 
expression level of a regulated gene a is determined by the 



expression levels of the transcription factors ax, aK ■ 

•Ea F a (x ai , . . . , X aK *) (1) 

with a G A C {!,..., N} running over all regulated genes. 




FIG. 1: Example for a small Boolean network, circles sym- 
bolize variables, squares Boolean functions. Xi is an example 
for an external input variable, Xi for a functional variable, 
whereas X3 stands for a regulatory variable, see below. 

The central question of this work is: are methods 
from the statistical mechanics of disordered systems able 
to capture number Aff p and organization of stationary 
points of this network, i.e. vectors x fulfilling simultane- 
ously all equations of type with a G A? The main 
step here is to write them as zero-energy ground states of 
a cost function, or Hamiltonian, defined as 



%a © Fa (pai i ■ ■■ 1 %a K ) 



(2) 



The symbol © stands for the logical XOR operation, 
i.e. each cost term contributes zero to the sum iff Eq. Q 
is fulfilled, and one otherwise. The Hamiltonian 7i(x) 
thus counts the number of unsatisfied Boolean eguations. 
This embeds the problem of finding fixed points into the 
class of constraint- satisfaction problems, which have been 
recently studied extensively from the point of view of sta- 
tistical physics 7], using various techniques from spin- 
glass physics, in particular the cavity method 0. Note 
that a related approach based on topological properties 
of the BN was independently proposed in . 
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In this work we concentrate on the interplay of dif- 
ferent types of functions (represented by F a ) for given 
probability measure of the topology of the network (rep- 
resented by the a, a\, clk )• For the moment we restrict 
the analysis to the case of only K = 2 inputs. Extensions 
will be discussed at the end of this Letter. 

Let us assume that the network is random, i.e. its 
topology is described by a random directed hyper-graph 
where the triples (a, ax, 02) are randomly chosen with the 
following conditions: (i) A function is regulated by at 
most one function, i.e. any two distinct triples (a, a\, 02) 
and (6,61,62) fulfill automatically a ^ b. (ii) There are 
M = aN of these triples. Condition (i) restricts the frac- 
tion a = M/N to the interval [0, 1]. We can distinguish 
three relevant types of variables as displayed in Fig. 
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input 
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FIG. 2: External input variables: There are N — M vari- 
ables which are not regulated by any function. They represent 
external inputs to the network. Regulatory variables are all 
those variables which are regulated and regulate. Functional 
variables: There are e~ 2a N variables which do not regulate 
any other function. 

We now have to specify the functions acting on top 
of the random topology. There are 2 2 = 16 Boolean 
functions, which can be grouped into 4 classes ^lj: (i) 
The two constant functions, (ii) Four functions depend- 
ing only on one of the two inputs, i.e. X\,x±,X2,~x~2- 

(iii) AND-OR class: There are eight functions, which are 
given by the logical AND or OR of the two input vari- 
ables, or of their negations. These functions are canal- 
izing in the sense that, e.g., if F(xi,X2) — x± A X2 and 
the value of x\ is set to zero, the output is fixed to zero 
independently of the value of X2- It is said that x\ is a 
canalizing variable of F with the canalizing value zero. 

(iv) XOR class: The last two functions are the XOR of 
the inputs, and its negation. They are not canalizing: 
Whatever input is changed, the output changes, too. 

Here we are only interested in true K — 2 functions, 
i.e. in the AND-OR class and the XOR class, but the 
extension of our analysis to the whole spectrum of K = 2- 
functions is a minor technical point. The organization of 
fixed points does not depend on the relative appearance 
of different functions within each class, but only on the 
relative appearance of classes. We therefore require xM 
functions to be in the XOR class, and the remaining (1 — 
x)M functions to be of AND-OR type, with < x < 1 
being a free model parameter. In this sense, for K = 2, 
the network ensemble is completely defined by a and x. 

Computational core — Many variables can be fixed 
following logical cascades starting in external variables, 



and the corresponding Boolean equations become ful- 
filled. Other functions can be satisfied trivially because 
they include variables being otherwise unconstrained (see 
below). However, a certain set of equations (depending 
both on the BN and the configuration of external vari- 
ables), cannot be satisfied on the basis of such simple 
local arguments. We define the computational core (CC) 
as the maximal subnetwork formed by these functions. 

The first process to be considered in this context is 
propagation of external regulation (PER) . If both inputs 
of a function are fixed externally, or if a canalizing input 
is fixed to its canalizing value, also the output is directly 
fixed externally. This process can be propagated, exploit- 
ing all direct logical implications of the configuration of 
the external input variables: all the functions which are 
trivially satisfied by this process can be removed. The 
probability that a function survives is 

KPER = 1 - (1 - aTTpER,) 2 - (1 - x)(l - CmpERjCmPER , 

i.e. neither both input variables are fixed by PER nor, 
in the case of a canalizing function, a canalizing variable 
is fixed by PER to its canalizing value. For small a, this 
equation has only the trivial solution ttper — 0, i.e. all 
variables are completely fixed by PER with probability 
one. At cxper{x) = 1/(1 + x), a new solution appears 
continuously. Above this point, a finite fraction of all 
functions survives the PER decimation process, forming 
the PER core. The core percolation point moves toward 
one in the limit x — > 0: in a pure AND-OR network 
an extensive PER core never exists due to the canaliz- 
ing character of the functions. PER is closely related 
to the percolation of information discussed in |2j , which 
studies the parallel evolution of two slightly distinct con- 
figurations. Identifying external variables in our model 
with constant functions in Kauffman's original formula- 
tion, the phase transition line becomes exactly the loca- 
tion of critical BN. The region left to the PER transition 
line corresponds to ordered, while the region on its right 
corresponds to chaotic networks under parallel dynamics 

The second process in determining the core is the leaf- 
removal procedure (LR). A leaf is a variable which has 
degree one, i.e. a variable which is either functional (reg- 
ulated but not regulating) or which is not regulated, and 
acts exactly on one other variable. A regulated leaf cor- 
responds to a Boolean equation which can be trivially 
fulfilled, the same is true if both input variables of a (non- 
constant) function are leaves. In the XOR case, even a 
simple input leaf allows for satisfying the Boolean equa- 
tion independently of the values of the other input and 
the output 12]. None of these cases induces complex be- 
havior based on frustration - meaning in this context that 
the fixed points of the Boolean network are easily deter- 
mined - and can be eliminated from the network. This 
process may induce new leaves, and can be iterated until 
no removable function is left. The survival probability of 
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0.8839 
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FIG. 3: Phase diagram for K = 2: The dotted line gives the 
percolation transition for the PER core, the dashed one for 
the LR core. The dash-dotted curve results from the com- 
bination of LR and PER, and merges in the tricritical point 
(0.785,0.636) with the LR curve. Below this point, the transi- 
tion is continuous, above discontinuous in the core size. The 
full line indicates the RSB transition from an unstructured 
solution set to a highly clustered one. 

a randomly selected Boolean function reads now: 

TTLR = X(l - t out ){l - t ln f + (1 - X){\ - t out ){l - t 2 m ) 

It depends still on the probability that a variable at a 
certain point in this process becomes a regulated (t ou t) 
or regulating (ti n ) leaf, which can be determined self- 
consistently using an iterative argument: 

t out = exp{-2a(l - t out )(l - xt m )} 
Un = (l - ax(l - t ln ) 2 - a(l - x)(l - t? n )) t out ■ 

There is always the solution ttlr = corresponding to 
complete removal of all functions by LR. At some x- 
dependent ollr{x), a second solution appears discontin- 
uously, and the LR core size jumps to a finite fraction of 
the complete network. The line o:lr(x) is monotonously 
growing in x, going from 1/2 for the pure AND-OR case 
to 0.8839 for the pure XOR case. 

Both procedures can be combined in order to deter- 
mine the computational core of the Boolean network un- 
der one external condition: First LR is applied, and then 
PER is needed to see which variables on the LR core are 
fixed via propagation of external regulation. The prob- 
lem can still be solved analytically. Here we give only 
the results, technical details will be presented elsewhere 
[l3| . For x > 0.636, the emergence of the CC is discon- 
tinuous and appears exactly at the point aLR(x). The 
LR core thus contains immediately an extensive compu- 
tational core. The height of the discontinuity decreases, 
however, if we approach x ~ 0.636 from above, and we 
find a tricritical point in (x,a) — (0.636,0,785). Below 



this point, the CC emerges only after the LR percolation 
transition, and the transition is continuous. The results 
are summarized in the phase diagram of Fig. [3J 

Clustering of solutions and complexity — What does 
the existence of an extensive CC imply in the structure 
and organization of fixed points? This questions can be 
answered by applying the zero-temperature cavity method 
of statistical mechanics to the Hamiltonian TL(x) defined 
in Eq. (J2J); technical details will be presented in a longer 
publication |l3| . The first result concerns the solution 
entropy which is found to be 

S =ilog(A/) p ) = (l-a)log(2) 

with the over-bar denoting the average over the random 
network ensemble with fixed a and x (with, in fact, no 
dependence on x). This implies that fixed points satis- 
fying all Boolean functions exist for all a < 1 and all x. 
The line a = 1 can be considered as the SAT/UNSAT 
transition line. The entropy value shows that each config- 
uration of the (1 — a)N external input variables leads on 
average to one stationary point, i.e. the state of all inter- 
nal and functional variables is mainly determined by the 
external conditions, even in the case of a PER core, where 
this fixing is not just a simple logical implication. Even 
if the entropy is analytic in all the phase diagram, the or- 
ganization of the fixed points in configuration space un- 
dergoes a clustering transition at some ctd{x), see Fig. [3] 
Below it, all solutions are collected in one cluster: any 
pair of them can be connected by a path via other solu- 
tions, where in each step only a finite number of variables 
can be changed. The solution space is technically named 
replica symmetric (RS). At the clustering transition, the 
replica symmetry breaks spontaneously (RSB): an expo- 
nential number of macroscopically separated clusters of 
fixed points appears. Their number, or more precisely 
its normalized logarithm, is called complexity. It jumps 
discontinuously at ad(x), as can be seen in Fig. 0] The 
clustered phase sets on in general well inside the region 
where a CC exist. Exceptions are the extreme points 
x = (neither core nor clustering appear at any a < 1) 
and x — 1 (both transition points coincide). 

Beyond K — 2 — The case of Boolean functions de- 
pending on exactly K = 2 can be generalized. However, 
the number of Boolean functions of K variables diverges 
as 2 2 . For K — 3, the 256 functions can still be classi- 
fied completely: there are 14 classes, 4 of them are the 
classes already discussed in the context of K = 2, and 
only 10 lead to actual K = 3 functions. For K > 4, 
non-exhaustive numerical checks were done. The main 
results are: (i) The minimum of the clustering point ad 
over all combinations of functional classes is always given 
by the pure K-XOK class (the only completely analyti- 
cally accessible). We find a d = 0.782, 0.699 for K = 3, 4 
and a d = (InK + In In K + In In In K + ...)/K for K > 1. 
Note that the range of the complex phase becomes larger 
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FIG. 4: Entropy s and complexity £ as a function of a, here 
for x = 1. We display both the analytical result and single 
sample values using belief and survey propagation (BP,SP) 
|9l Il3|. The complexity curves for x < 1 are similar, but the 
discontinuous onset of the clustered phase is shifted to higher 
a d (x). 



A second interesting issue is moving from average case 
analysis to real networks samples. Our method can be 
directly implemented as a message passing algorithm on 
single network instances, and not only on random en- 
sembles as discussed in this work. However, very few 
genome- wide data are so far available. In particular, for 
multi-cellular organisms only small functional modules 
for well-described functions are known, and large-scale 
networks known for yeast 0] and E. coli contain 
only the topological structure, not an extensive descrip- 
tion of the regulatory functions. It is thus highly inter- 
esting to infer gene regulation networks from experimen- 
tally easily accessible high-throughput experiments, as 
e.g. given by genome- wide expression profiles jlfij . This 
inverse problem could be treated with tools similar to the 
ones used in the present analysis. 
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with K. (ii) We conjecture that, for fixed K, the only 
class not leading to clustering is the pure if-AND-OR 
one. This has been checked explicitly for K = 3, and 
non-exhaustively also for K = 4. This class becomes, 
however, both combinatorially and biologically less im- 
portant for higher K, where threshold-like functions are 
expected to be more relevant. 

Conclusion and outlook — We have analyzed the orga- 
nization of fixed points in random Boolean networks iden- 
tifying the sudden emergence of a computational core, 
whose existence is a necessary (but not sufficient) condi- 
tion for a globally complex phase where all fixed points 
are organized in an exponential number of macroscopi- 
cally separated clusters. This phenomenon is found to 
be robust with respect to the choice of the Boolean func- 
tions, and missing only in networks completely made of 
AND and OR functions. In addition, the size of the com- 
plex regulatory phase grows if a higher number K of in- 
puts to the Boolean functions is allowed. 

Open Questions — Our analysis is based on ground 
states of a Hamiltonian counting the number of violated 
functions, not directly related to any overlying biolog- 
ically motivated dynamics. The dynamical accessibility 
of fixed points remains an open challenge which will be 
addressed in a future project. Moreover, noise sources 
present in real cases may force the network to quasi- 
stationary points close to the fixed ones (some regulation 
might not always function properly, without effecting the 
overall state of a cell). In this view, the study of the or- 
ganization and accessibility of meta-stable states in the 
region of low complexity and in the case of fixed external 
inputs, together with their relation with quasi-stationary 
points might be very relevant. 
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